Following the Project Data Science Lifecycle as a process structure, utilize the Health & Human Service Data Breach as a central dataset in extrapolating meaningful predictions or findings about Health Care related Data Breaches in the US. In particular, using time series forecasting, attempt to predict number of breaches, the breach types, and locations of breaches. This will involving training and utilizing multiple time series models. Useful time series models will probably operate in monthly frequencies.
import math
import pandas as pd, numpy as np, plotly.express as px, plotly.graph_objs as go, plotly.figure_factory as ff, plotly.io as pio
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
import pmdarima as pm
from sklearn.metrics import mean_absolute_error, mean_squared_error
pio.renderers.default = "plotly_mimetype+notebook"
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf# Select Archive, then Research Report, and then excel or csv export to get raw data. Direct Link cannot be used due to server side session dynamic obfuscation on client side JS.
Latest Data Retrieval 5/5/2023
Because we are concerned with time series analysis, the data should be read in, sorted chronologically, and eventually indexed via time. We will extract the year separately for some visual analysis.
df = pd.read_csv("breach_report.csv")
df['Breach Submission Date'] = pd.to_datetime(df['Breach Submission Date'], format='%m/%d/%Y')
df['Year of Breach Submission'] = df['Breach Submission Date'].dt.year
df.sort_values(by='Breach Submission Date', ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()
| Name of Covered Entity | State | Covered Entity Type | Individuals Affected | Breach Submission Date | Type of Breach | Location of Breached Information | Business Associate Present | Web Description | Year of Breach Submission | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Brooke Army Medical Center | TX | Healthcare Provider | 1000.0 | 2009-10-21 | Theft | Paper/Films | No | A binder containing the protected health infor... | 2009 |
| 1 | Mid America Kidney Stone Association, LLC | MO | Healthcare Provider | 1000.0 | 2009-10-28 | Theft | Network Server | No | Five desktop computers containing unencrypted ... | 2009 |
| 2 | Alaska Department of Health and Social Services | AK | Healthcare Provider | 501.0 | 2009-10-30 | Theft | Other, Other Portable Electronic Device | No | The Alaska Department of Health and Social Ser... | 2009 |
| 3 | Health Services for Children with Special Need... | DC | Health Plan | 3800.0 | 2009-11-17 | Loss | Laptop | No | A laptop was lost by an employee while in tran... | 2009 |
| 4 | Michele Del Vicario, MD | CA | Healthcare Provider | 6145.0 | 2009-11-20 | Theft | Desktop Computer | No | A shared Computer that was used for backup was... | 2009 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5356 entries, 0 to 5355 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name of Covered Entity 5356 non-null object 1 State 5341 non-null object 2 Covered Entity Type 5352 non-null object 3 Individuals Affected 5355 non-null float64 4 Breach Submission Date 5356 non-null datetime64[ns] 5 Type of Breach 5355 non-null object 6 Location of Breached Information 5356 non-null object 7 Business Associate Present 5356 non-null object 8 Web Description 4250 non-null object 9 Year of Breach Submission 5356 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(1), object(7) memory usage: 397.6+ KB
The dataset contains records for 5356 healthcare data breaches, has one identifier variable(Name of Covered Entity), 4 Categorical(Covered Entity Type, Type of Breach, Location of Breached Information, and State), 1 boolean(string currently:Business Associate Present), 1 Date(Breach Submission Date), 1 Text (Web Description), and 1 quantitative (Individuals Affected).
df.isnull().sum()
Name of Covered Entity 0 State 15 Covered Entity Type 4 Individuals Affected 1 Breach Submission Date 0 Type of Breach 1 Location of Breached Information 0 Business Associate Present 0 Web Description 1106 Year of Breach Submission 0 dtype: int64
The dataset does not appear to be missing a lot of data at first glance, although Web Descriptions are missing for nontrivial of breach observations. This may be in part due to the breach not being publicly announced via a website. The Web Descriptions may provide opportunities for feature creation as they may contain valuable information. There is also a possible of incorporating them utilizing sentiment analysis although caution should be taken relative to wanting to utilize most of the 5356 records, and losing 20% due to missing web descriptions(a natural occurrence for some breaches) is not desirable. Therefore, if they are used, the absence of one should hopefully result in a meaningful data point as well for a breach observation. This is unlikely though.
df['State'].unique()
array(['TX', 'MO', 'AK', 'DC', 'CA', 'PA', 'TN', 'VA', 'NC', 'MI', 'MA',
'MD', 'ID', 'IL', 'UT', 'AZ', 'RI', 'PR', 'FL', 'NM', 'NY', 'WY',
'WI', 'WA', 'CT', 'AL', 'GA', 'SC', 'KY', 'MN', 'CO', 'NE', 'KS',
'OH', 'NV', 'IN', 'OR', 'IA', 'OK', 'AR', 'MS', 'NH', 'MT', 'NJ',
'WV', nan, 'LA', 'ND', 'HI', 'SD', 'ME', 'VT', 'DE'], dtype=object)
The dataset includes Puerto Rico & DC as states for convenience. Records with NAN states should be addressed in the data preparation phase. With High Dimensionality from encoding, these might be useful for visual analysis, but typically not machine learning unless geospatial analysis is central.
df['Covered Entity Type'].value_counts()
Covered Entity Type Healthcare Provider 3890 Business Associate 766 Health Plan 686 Healthcare Clearing House 10 Name: count, dtype: int64
Healthcare Sector Organizations seem to fall under 4 types of classification on the HITECH Act of 2009, alongside HHS's treatment of the Sector. Healthcare Providers are entities that directly provide healthcare to patients or individuals. Business Associates other entities that far under specific HIPAA and HHS guidelines(https://www.hhs.gov/hipaa/for-professionals/privacy/guidance/business-associates/index.html). For the purpose of this report, they will be referred to as third party vendors to better differentiate them.
These will likely encoded for potential machine learning models and are important for understanding the problem of healthcare data problems. Observing the trends in which entities are breached and which entities affect the most individuals when breached are paramount.
df['Type of Breach'].unique()
array(['Theft', 'Loss', 'Other', 'Unauthorized Access/Disclosure',
'Hacking/IT Incident', nan, 'Loss, Theft', 'Improper Disposal',
'Improper Disposal, Loss', 'Other, Theft', 'Loss, Other',
'Improper Disposal, Loss, Theft', 'Unknown',
'Theft, Unauthorized Access/Disclosure',
'Hacking/IT Incident, Unauthorized Access/Disclosure',
'Other, Unauthorized Access/Disclosure',
'Hacking/IT Incident, Other', 'Other, Unknown',
'Loss, Unauthorized Access/Disclosure, Unknown',
'Loss, Theft, Unauthorized Access/Disclosure, Unknown',
'Hacking/IT Incident, Other, Unauthorized Access/Disclosure',
'Hacking/IT Incident, Theft, Unauthorized Access/Disclosure',
'Improper Disposal, Theft', 'Hacking/IT Incident, Theft',
'Hacking/IT Incident, Improper Disposal, Loss, Other, Theft, Unauthorized Access/Disclosure, Unknown',
'Loss, Other, Theft',
'Other, Theft, Unauthorized Access/Disclosure',
'Improper Disposal, Theft, Unauthorized Access/Disclosure',
'Loss, Unknown', 'Loss, Unauthorized Access/Disclosure',
'Improper Disposal, Unauthorized Access/Disclosure'], dtype=object)
df['Type of Breach'].value_counts()
Type of Breach Hacking/IT Incident 2599 Unauthorized Access/Disclosure 1293 Theft 979 Loss 205 Improper Disposal 108 Other 75 Theft, Unauthorized Access/Disclosure 25 Loss, Theft 15 Unknown 10 Hacking/IT Incident, Unauthorized Access/Disclosure 8 Other, Unauthorized Access/Disclosure 6 Other, Theft 3 Improper Disposal, Loss, Theft 3 Loss, Unauthorized Access/Disclosure 3 Improper Disposal, Loss 3 Hacking/IT Incident, Theft, Unauthorized Access/Disclosure 2 Other, Theft, Unauthorized Access/Disclosure 2 Hacking/IT Incident, Other 2 Other, Unknown 2 Loss, Other 2 Loss, Theft, Unauthorized Access/Disclosure, Unknown 1 Hacking/IT Incident, Other, Unauthorized Access/Disclosure 1 Loss, Unauthorized Access/Disclosure, Unknown 1 Improper Disposal, Theft 1 Hacking/IT Incident, Theft 1 Hacking/IT Incident, Improper Disposal, Loss, Other, Theft, Unauthorized Access/Disclosure, Unknown 1 Loss, Other, Theft 1 Improper Disposal, Theft, Unauthorized Access/Disclosure 1 Loss, Unknown 1 Improper Disposal, Unauthorized Access/Disclosure 1 Name: count, dtype: int64
len(df['Type of Breach'].unique())
31
Type of breach and location of breached information are interesting due to multiple categorical labels being applied to a small proportion of breaches. It's safe to assume that this is because these breaches involved multiple acts leading to the disclosure of data. After calculating the number of unique combinations, there are too many to include. As the independence/dependence of breach type combinations would be difficult to aggregate into their own categorical accurately and honestly, it is perhaps best to simply make new columns called first breach type and first location of breach to reduce the number of labels a more manageable number.
Dataset analysis resulted in some obvious actions to take:
# Simply use "Not Provided" for records without web descriptions.
df['Web Description'].fillna("Not Provided", inplace=True)
df[df['Type of Breach'].isnull()]
| Name of Covered Entity | State | Covered Entity Type | Individuals Affected | Breach Submission Date | Type of Breach | Location of Breached Information | Business Associate Present | Web Description | Year of Breach Submission | |
|---|---|---|---|---|---|---|---|---|---|---|
| 54 | Computer Program and Systems, Inc. (CPSI) | AL | Business Associate | 768.0 | 2010-03-30 | NaN | Yes | \N | 2010 |
For the one null Type of Breach value, we are given the location of breached information, but not the type. Email suggests it was at least Hacking/IT Incident or Unauthorized Access/Disclosure. Because it was a third party vendor, and a third party associate was directly involved, it was likely an disclosure.There is also reference to this breach being an "Unauthorized Access/Disclosure" from https://search.r-project.org/CRAN/refmans/Ecdat/html/HHSCyberSecurityBreaches.html.
df.at[54, "Type of Breach"] = "Unauthorized Access/Disclosure"
df[df['Individuals Affected'].isnull()]
| Name of Covered Entity | State | Covered Entity Type | Individuals Affected | Breach Submission Date | Type of Breach | Location of Breached Information | Business Associate Present | Web Description | Year of Breach Submission | |
|---|---|---|---|---|---|---|---|---|---|---|
| 802 | Valperaiso Fire Department | IN | Health Plan | NaN | 2013-09-03 | Theft | Desktop Computer | No | This case has been consolidated with another r... | 2013 |
df[df['Individuals Affected'].isnull()]['Web Description']
802 This case has been consolidated with another r... Name: Web Description, dtype: object
This breach seems to be involved with a larger number of breaches across states:
https://www.healthdatamanagement.com/articles/tax-fraud-fueled-breach-hits-36th-fire-department
but this article does mention that it was 860 individuals affected.
df.at[802, "Individuals Affected"] = 860
df[df['Covered Entity Type'].isnull()]
| Name of Covered Entity | State | Covered Entity Type | Individuals Affected | Breach Submission Date | Type of Breach | Location of Breached Information | Business Associate Present | Web Description | Year of Breach Submission | |
|---|---|---|---|---|---|---|---|---|---|---|
| 967 | HealthSource of Ohio Inc. | OH | NaN | 8845.0 | 2014-02-26 | Unauthorized Access/Disclosure | Other | Yes | Health Source of Ohio, the covered entity (CE)... | 2014 |
| 1042 | Boston Medical Center | MA | NaN | 15265.0 | 2014-04-29 | Unauthorized Access/Disclosure | Network Server | Yes | Boston Medical Center, the covered entity (CE)... | 2014 |
| 1097 | BioReference Laboratories Inc | NJ | NaN | 3334.0 | 2014-07-23 | Unauthorized Access/Disclosure | Network Server | Yes | The covered entity (CE), BioReference Laborato... | 2014 |
| 2310 | Boys Town National Research Hospital | NE | NaN | 2182.0 | 2018-05-09 | Hacking/IT Incident | Yes | The covered entity (CE), Boys Town National Re... | 2018 |
It is unclear why there are nulls for these covered entities. For simplicity sake, all 4 will be defined based on their own website.
df.at[2310, "Covered Entity Type"] = "Healthcare Provider" # Hospital
df.at[1097, "Covered Entity Type"] = "Healthcare Provider" # Patient Testing Facility
df.at[1042, "Covered Entity Type"] = "Healthcare Provider" # Hospital/Research Center
df.at[967, "Covered Entity Type"] = "Healthcare Provider" # Multiple Healthcare Services Provider
df[df['State'].isnull()]
| Name of Covered Entity | State | Covered Entity Type | Individuals Affected | Breach Submission Date | Type of Breach | Location of Breached Information | Business Associate Present | Web Description | Year of Breach Submission | |
|---|---|---|---|---|---|---|---|---|---|---|
| 243 | Departamento de Salud de Puerto Rico | NaN | Healthcare Provider | 2621.0 | 2011-02-22 | Loss | Other Portable Electronic Device | No | The covered entity (CE) reported that on March... | 2011 |
| 1315 | Puerto Rico Department of Heatlh - Medicaid Pr... | NaN | Health Plan | 500.0 | 2015-04-22 | Theft | Other | No | The covered entity (CE) reported that on Febru... | 2015 |
| 2208 | Triple-S Advantage, Inc. | NaN | Health Plan | 36305.0 | 2018-02-02 | Unauthorized Access/Disclosure | Paper/Films | No | Subsequent to Triple-S Management entering int... | 2018 |
| 2675 | Metro Santurce, Inc. d/b/a Hospital Pavia Sant... | NaN | Healthcare Provider | 305737.0 | 2019-04-13 | Hacking/IT Incident | Network Server | No | The covered entities (CEs), Metro Santurce Inc... | 2019 |
| 2706 | Inspira Behavioral Care, Corp | NaN | Healthcare Provider | 4246.0 | 2019-05-02 | Theft | Desktop Computer | No | Inspira Behavioral Care, Corp., the covered en... | 2019 |
| 2716 | Inmediata Health Group, Corp. | NaN | Healthcare Clearing House | 1565338.0 | 2019-05-07 | Unauthorized Access/Disclosure | Network Server | No | Not Provided | 2019 |
| 2746 | Farmacia La Amistad Inc. | NaN | Healthcare Provider | 2500.0 | 2019-05-24 | Hacking/IT Incident | Network Server | No | Farmacia La Amistad, the covered entity (CE), ... | 2019 |
| 2824 | Bayamon Medical Center Corp. | NaN | Healthcare Provider | 422496.0 | 2019-07-19 | Hacking/IT Incident | Network Server | No | Bayamon Medical Center, the covered entity (CE... | 2019 |
| 2826 | Puerto Rico Women And Children's Hospital, LLC | NaN | Healthcare Provider | 99943.0 | 2019-07-19 | Hacking/IT Incident | Network Server | No | Puerto Rico Women and Children’s Hospital, the... | 2019 |
| 2916 | Intramural Practice Plan - Medical Sciences Ca... | NaN | Healthcare Provider | 439753.0 | 2019-09-16 | Hacking/IT Incident | Network Server | No | Intramural Practice Plan - Medical Sciences Ca... | 2019 |
| 3422 | Servicios de Salud Primarios de Barceloneta, I... | NaN | Healthcare Provider | 60200.0 | 2020-09-04 | Hacking/IT Incident | No | The covered entity (CE), Servicios de Salud Pr... | 2020 | |
| 4550 | Laboratorio Clínico Caparros | NaN | Healthcare Provider | 500.0 | 2022-03-06 | Hacking/IT Incident | Network Server | No | The covered entity (CE), Laboratorio Clínico C... | 2022 |
| 4563 | Laboratorio Clinico Toledo | NaN | Healthcare Provider | 500.0 | 2022-03-14 | Hacking/IT Incident | Network Server | No | Not Provided | 2022 |
| 5072 | Doctors' Center Hospital | NaN | Healthcare Provider | 1195220.0 | 2022-11-09 | Hacking/IT Incident | Network Server | No | Not Provided | 2022 |
| 5345 | Modern Cardiology Associates | NaN | Healthcare Provider | 10000.0 | 2023-04-19 | Hacking/IT Incident | Network Server | No | Not Provided | 2023 |
All records with missing states appear to be from Puerto Rico except for Inspira, which is New Jersey.
df.at[2706 , "State"] = "NJ"
df['State'].fillna("PR", inplace=True)
df.isnull().sum()
Name of Covered Entity 0 State 0 Covered Entity Type 0 Individuals Affected 0 Breach Submission Date 0 Type of Breach 0 Location of Breached Information 0 Business Associate Present 0 Web Description 0 Year of Breach Submission 0 dtype: int64
There are now no null values in the dataset
Creating the first breach type column is a simple way of dealing with the multi-value label problem for Breach Type
# split column by ',' and keep only the first column
df['First Breach Type'] = df['Type of Breach'].str.split(',').str[0]
df['First Location of Breach'] = df['Location of Breached Information'].str.split(',').str[0]
From the IBM data report from 2021, we had 499 USD for 2020. For 2009-2019, we were able to use the consolidated table from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349636/. We needed values for 2021-2023. For some reason, the IBM cost of a data breach in 2022 only had the average cost of a data record across sectors, but healthcare records have always been more valuable than the average record. This may be in part to their research disclaimer about the strength of the US dollar making this value questionable for that year:
"This year, a strong U.S. dollar significantly influenced the
global cost analysis. The conversion from local currencies
to the U.S. dollar deflated the per record and average total
cost estimates. For purposes of consistency with prior
years, we decided to continue to use the same accounting
method rather than adjust the cost" (https://www.ibm.com/downloads/cas/OJDVQGRY).
Measuring the actual value outside of a currency is difficult, but the trend is obvious, that healthcare record cost values continue to rise for organizations. Please note, these average costs per records for an healthcare organization are estimations in and of themselves by IBM and the Ponemon Institute. Organizations tend to downplay the costs of a data breach, and do not include indirect costs typically, meaning these values could be under estimations.
Using the table, the values for 2021-2023 were imputed via Exponential Smoothing Forecasting due to the small number of values.
healthcare_record_costs = {
2009: 270,
2010: 294,
2011: 240,
2012: 233,
2013: 296,
2014: 359,
2015: 363,
2016: 355,
2017: 380,
2018: 408,
2019: 429,
2020: 499
}
costs_df = pd.DataFrame(list(healthcare_record_costs.items()), columns=["Year", "Cost"])
costs_df.set_index("Year", inplace=True)
model = ExponentialSmoothing(costs_df, seasonal=None, trend='additive', damped_trend=True)
model_fit = model.fit()
n_missing_years = 3
forecast = model_fit.forecast(n_missing_years)
for i in range(n_missing_years):
year = 2021 + i
value = forecast.iloc[i]
healthcare_record_costs[year] = value
print(healthcare_record_costs)
{2009: 270, 2010: 294, 2011: 240, 2012: 233, 2013: 296, 2014: 359, 2015: 363, 2016: 355, 2017: 380, 2018: 408, 2019: 429, 2020: 499, 2021: 518.9987039305105, 2022: 538.8897530228002, 2023: 558.6645878704182}
d:\Python\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. d:\Python\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. d:\Python\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
# floor the floats down to be consistent with the other estimated values
healthcare_record_costs = {year: math.floor(value) for year, value in healthcare_record_costs.items()}
df['Year of Breach Submission']
0 2009
1 2009
2 2009
3 2009
4 2009
...
5351 2023
5352 2023
5353 2023
5354 2023
5355 2023
Name: Year of Breach Submission, Length: 5356, dtype: int32
# Create average cost per healthcare record per year
df['Average Cost Per Record Per Breach Year'] = df['Year of Breach Submission'].apply(lambda x: healthcare_record_costs[x])
df.head()
| Name of Covered Entity | State | Covered Entity Type | Individuals Affected | Breach Submission Date | Type of Breach | Location of Breached Information | Business Associate Present | Web Description | Year of Breach Submission | First Breach Type | First Location of Breach | Average Cost Per Record Per Breach Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Brooke Army Medical Center | TX | Healthcare Provider | 1000.0 | 2009-10-21 | Theft | Paper/Films | No | A binder containing the protected health infor... | 2009 | Theft | Paper/Films | 270 |
| 1 | Mid America Kidney Stone Association, LLC | MO | Healthcare Provider | 1000.0 | 2009-10-28 | Theft | Network Server | No | Five desktop computers containing unencrypted ... | 2009 | Theft | Network Server | 270 |
| 2 | Alaska Department of Health and Social Services | AK | Healthcare Provider | 501.0 | 2009-10-30 | Theft | Other, Other Portable Electronic Device | No | The Alaska Department of Health and Social Ser... | 2009 | Theft | Other | 270 |
| 3 | Health Services for Children with Special Need... | DC | Health Plan | 3800.0 | 2009-11-17 | Loss | Laptop | No | A laptop was lost by an employee while in tran... | 2009 | Loss | Laptop | 270 |
| 4 | Michele Del Vicario, MD | CA | Healthcare Provider | 6145.0 | 2009-11-20 | Theft | Desktop Computer | No | A shared Computer that was used for backup was... | 2009 | Theft | Desktop Computer | 270 |
We can now estimate the total breach cost for that breach based off Average Cost Per Record Per Breach Year and Individuals Affected with a scaling factor due to per record costs decreasing when breaches are very large.
df["Estimated Total Breach Cost"] = df['Average Cost Per Record Per Breach Year'] * (df['Individuals Affected'] ** 0.9)
df['Estimated Total Breach Cost'].mean()
7984134.376931305
This mean total healthcare breach based off record approaches many estimations for each year in IBM's cost of a data breach reports, but this represents a major amount of bias and uncertainty with any further analysis. Given companies are not forthcoming nor even motivated to produce accurate results when reporting breaches, this may be one of the only ways to estimate the cost without a comprehensive study of quarterly expense reports for the organizations that have this information publicly available.
A presumption is that the most frequent breach type for healthcare would change over the last 15 years due to the increased reliance on tech and rise of digital hacking. Healthcare records were known to be valuable even before the rise of the dark web, and would often be physically burglarized in order to be resold on black markets. Incentives would included in the HITECH Act of 2009 to encourage the usr of electronic medical record and health systems that would help digitize operations and reduce the complex management of paper/film records.
# Count the number of breaches per year, grouped by type
breach_counts = df.groupby(['Year of Breach Submission', 'First Breach Type']).size().reset_index(name='Count')
breach_counts = breach_counts[(breach_counts['Year of Breach Submission'] != 2009) & (breach_counts['Year of Breach Submission'] != 2023) & (breach_counts['First Breach Type'] != 'Unknown')]
# Calculate the total counts per year
total_counts = breach_counts.groupby('Year of Breach Submission')['Count'].sum().reset_index(name='Total')
# Merge breach_counts with total_counts to calculate proportions
breach_counts = breach_counts.merge(total_counts, on='Year of Breach Submission')
breach_counts['Proportion'] = (breach_counts['Count'] / breach_counts['Total']) * 100
# Create a list of formatted percentage strings
formatted_percentages = [f"{round(prop, 1)}%" for prop in breach_counts['Proportion']]
# Create a stacked bar chart using with proportions as inner bar text labels
fig = px.bar(breach_counts, x='Year of Breach Submission', y='Count', barmode='stack', color='First Breach Type', text=formatted_percentages)
# Iterate through the grouped DataFrame and add total count annotations above the bars
for year, grp in breach_counts.groupby('Year of Breach Submission'):
total_count = grp['Count'].sum()
fig.add_annotation(
x=year,
y=total_count,
text=str(total_count),
showarrow=False,
font=dict(size=12),
yshift=5
)
# Set auto-resizing
fig.update_layout(
title=dict(
text='Hackers & Healthcare<br><sub>Number of Breaches per Year by Breach Type (2010-2022)</sub>',
x=0.5,
y=0.95,
font=dict(size=18)
),
autosize=True,
margin=go.layout.Margin(l=0, r=0, t=30, b=0),
width=None,
height=None
)
# Get total counts per year
total_counts = breach_counts.groupby('Year of Breach Submission')['Count'].sum().reset_index(name='Total')
# Add labels for total counts above each bar
annotations = []
for index, row in total_counts.iterrows():
annotations.append(go.layout.Annotation(
x=row['Year of Breach Submission'],
y=row['Total'] + 15,
text=str(row['Total']),
showarrow=False,
font=dict(size=8, color='black')
))
fig.update_traces(textfont=dict(size=5.5))
fig.update_yaxes(automargin=True)
fig.update_layout(annotations=annotations)
fig.show()
There are several trends that can be deduced from the above annual stacked bar chart concerning healthcare breaches. First, breaches are occurring more frequently each year. Secondly, Cyber Security incidents have been accounting for the majority of data breaches since 2017. This trend will likely continue for years to come and probably is not very surprising. What is distinct about healthcare breaches though, is that these records are valuable enough to physically burglarize. From 2010 to 2014, this accounted for the majority of breaches. This is due to the high price medical records would fetch on black markets.
def format_million(value):
return f"{round(value / 1_000_000, 1)}M"
# Calculate the total number of individuals affected per year, grouped by breach type
affected_counts = df.groupby(['Year of Breach Submission', 'First Breach Type'])['Individuals Affected'].sum().reset_index(name='Affected')
affected_counts = affected_counts[(affected_counts['Year of Breach Submission'] != 2009) & (affected_counts['Year of Breach Submission'] != 2023) & (affected_counts['First Breach Type'] != 'Unknown')]
# Create a stacked bar chart
fig = px.bar(affected_counts, x='Year of Breach Submission', y='Affected', barmode='stack', color='First Breach Type', title='The Hacked & Affected: Individuals Affected per Year by Type (2010-2022)')
# Iterate through the grouped DataFrame and add total affected labels above the bars
for year, grp in affected_counts.groupby('Year of Breach Submission'):
total_count = grp['Affected'].sum()
fig.add_annotation(
x=year,
y=total_count,
text=format_million(total_count),
showarrow=False,
font=dict(size=12),
yshift=5
)
# Add title & subtitle
fig.update_layout(
title=dict(
text='The Hacked & Affected<br><sub>Individuals Affected per Year by Type (2010-2022)</sub>',
x=0.5,
y=0.95,
font=dict(size=18)
),
autosize=True,
margin=go.layout.Margin(l=0, r=0, t=60, b=0),
width=None,
height=None
)
fig.update_traces(textfont=dict(size=5.5))
fig.update_yaxes(automargin=True)
fig.show()
When broken down into cumulative records lost by in healthcare breaches by type, again, cybersecurity incidents constituted a vast majority of records exposed annually, while physical theft used to be. This is not surprising as digitally stealing millions of records is easier than carrying millions of records physically.It's also obvious that bad actors(state sponsored and otherwise), caught onto to this being the much more effective and safe way of stealing the same records they would attempt to burglarize previously. Why risk several felonies stateside when you can steal the same data from a location where you will never be charged? 2015 also holds the landmark Anthem data breach, which is the largest healthcare data breach on record with about 80 million PII records stolen including social security numbers, health identification numbers, names, and birthdays. Anthem's total settlement and punitive fees to Indiana's State attorney, Health and Human Services, and class action lawsuits filed by victims totaled around 180 million US dollars by 2019. This breach was likely sponsored by Chinese intelligence groups, who have been hoarding the data supposedly, as no fraud acts have been directly linked to the leak. But, this was the breach that convinced hacker groups around the world, that this is the way to steal the identities of American citizens from now on. Complete medical records can fetch upwards of about 1000$ per pop in some cases per a study by Experian (https://www.experian.com/blogs/ask-experian/heres-how-much-your-personal-information-is-selling-for-on-the-dark-web/) on dark web sales. Note that this is the value of the record on the black market, not the cost of exposing a record for an organization.
Anthem's hack in some ways possibly lowered the average cost per healthcare record.
# Filter out years 2009 and 2023 again
filtered_df = df[(df['Year of Breach Submission'] != 2009) & (df['Year of Breach Submission'] != 2023)]
# Create time series plot
fig = px.line(filtered_df, x='Year of Breach Submission', y='Average Cost Per Record Per Breach Year', title='Average Cost Per Record per Year', markers=True)
fig.update_layout(
title=dict(
text='Average Cost Per Record Annually(2010-2022)</sub>',
x=0.5,
y=0.95,
font=dict(size=18)
),
autosize=True,
margin=go.layout.Margin(l=0, r=0, t=60, b=0),
width=None,
height=None
)
fig.update_traces(textfont=dict(size=5.5))
fig.update_yaxes(automargin=True)
fig.show()
Anthem's effect on the data as an outlier can be observed more drastically in the estimated total breach cost.
# Filter out years 2009 and 2023 again
filtered_df = df[(df['Year of Breach Submission'] != 2009) & (df['Year of Breach Submission'] != 2023)]
# get average cost per year
average_cost_per_year = filtered_df.groupby('Year of Breach Submission')['Estimated Total Breach Cost'].mean().reset_index()
# Create a time series plot
fig = px.line(average_cost_per_year, x='Year of Breach Submission', y='Estimated Total Breach Cost', title='Average Cost of a Breach per Year', markers=True)
fig.update_layout(
title=dict(
text='Average Cost of a Breach Annually (2010-2022)',
x=0.5,
y=0.95,
font=dict(size=18)
),
autosize=True,
margin=go.layout.Margin(l=0, r=0, t=60, b=0),
width=None,
height=None
)
fig.show()
In many ways, anthem's breach shows the extent in which the estimator for total breach cost utilizing avg cost per record is inadequate, particularly in the case of outliers.
df.iloc[1253]
Name of Covered Entity Anthem Inc. State IN Covered Entity Type Health Plan Individuals Affected 78800000.0 Breach Submission Date 2015-02-13 00:00:00 Type of Breach Hacking/IT Incident Location of Breached Information Network Server Business Associate Present No Web Description Anthem, Inc. has agreed to pay $16 million to ... Year of Breach Submission 2015 First Breach Type Hacking/IT Incident First Location of Breach Network Server Average Cost Per Record Per Breach Year 363 Estimated Total Breach Cost 4642802619.242887 Name: 1253, dtype: object
While Anthem suffered severe costs not observed in not just healthcare data breaches, but any data breaches, it did not cost the company 46 billion dollars. Many breach datasets remove mega breaches from their studies due to the bias they introduce, as the cost to an organization per record lowers the more records lost, and most breaches do not result in 78 million records exposed.
Lastly, to ensure trends in healthcare data breaches are fully observed in this dataset. The location of breach types and type of breach should indicate the type of specific attack usually carried on healthcare providers specifically as covered entity types, and that is ransomware.
filtered_df = df[(df['Year of Breach Submission'] != 2009) & (df['Year of Breach Submission'] != 2023)]
# Group by year and covered entity type, and sum the 'Individuals Affected' column
grouped_df = filtered_df.groupby(['Year of Breach Submission', 'Covered Entity Type'])['Individuals Affected'].sum().reset_index()
# Create a bubble chart
fig = px.scatter(grouped_df,
x='Year of Breach Submission',
y='Individuals Affected',
size='Individuals Affected',
color='Covered Entity Type',
title='Cumulative Individuals Affected by Breaches per Year',
labels={'Individuals Affected': 'Total Individuals Affected'},
size_max=60)
fig.update_layout(
title=dict(
text='Healthcare Providers Hit the Most<br><sub>Cumulative Individuals Affected by Breaches per Year per Covered Entity Type</sub>',
x=0.5,
y=0.95,
font=dict(size=18)
),
autosize=True,
margin=go.layout.Margin(l=0, r=0, t=60, b=0),
width=None,
height=None
)
fig.show()
In the above annual bubble chart, once again Anthem muddied the waters as a Health Plan entity type, but it's important to observe that besides that outlier, healthcare providers such as hospitals and care facilities have been targeted the most in recent years, which is particularly troubling given the other issues these healthcare providers have been experiencing with Covid-19 and overwhelmed staffing. Hospitals in particular are often hit with ransomware attacks.
# Create a treemap
fig = px.treemap(df,
path=['Covered Entity Type', 'First Breach Type', 'First Location of Breach'],
values='Individuals Affected',
color='Covered Entity Type',
title='Individuals Affected by Breaches: Entity Type, Breach Type, and Location')
fig.update_layout(
title=dict(
text='Individuals Affected by Breaches<br><sub>Grouped by Entity Type, Breach Type, and Location</sub>',
x=0.5,
y=0.95,
font=dict(size=18)
),
autosize=True,
margin=go.layout.Margin(l=0, r=0, t=60, b=0),
width=None,
height=None
)
fig.show()
This can be observed by the above treemap. One can see that hacking has cost every healthcare entity type the most, with network servers and email based phishing attacks resulting in the breach across the board as well. These are most likely ransomware attacks, which are even more profitable to bad actors than normal breaches and costlier to healthcare organizations. Hackers are able to double dip when they successfully execute ransomware attacks against healthcare organizations. They exfiltrate the encrypted data, and then they can ransom the decryption keys back to providers, who given the nature of their work, are more willing to pay in order to access their saving and provide care to their patients.
For most robust statistical analysis and machine learning, all the categorical variables should be encoded in order to be able to be processed by machine learning algorithms that expect all numerical data types.
df.dtypes
Name of Covered Entity object State object Covered Entity Type object Individuals Affected float64 Breach Submission Date datetime64[ns] Type of Breach object Location of Breached Information object Business Associate Present object Web Description object Year of Breach Submission int32 First Breach Type object First Location of Breach object Average Cost Per Record Per Breach Year int64 Estimated Total Breach Cost float64 dtype: object
df['Business Associate Present'] = df['Business Associate Present'].replace({'Yes':1, 'No' :0}).astype(int)
df = pd.concat([df, pd.get_dummies(df['Covered Entity Type'], prefix="Entity Type").astype(int)], axis =1)
df = pd.concat([df, pd.get_dummies(df['First Breach Type'], prefix="First Breach Type").astype(int)], axis =1)
df.columns
Index(['Name of Covered Entity', 'State', 'Covered Entity Type',
'Individuals Affected', 'Breach Submission Date', 'Type of Breach',
'Location of Breached Information', 'Business Associate Present',
'Web Description', 'Year of Breach Submission', 'First Breach Type',
'First Location of Breach', 'Average Cost Per Record Per Breach Year',
'Estimated Total Breach Cost', 'Entity Type_Business Associate',
'Entity Type_Health Plan', 'Entity Type_Healthcare Clearing House',
'Entity Type_Healthcare Provider',
'First Breach Type_Hacking/IT Incident',
'First Breach Type_Improper Disposal', 'First Breach Type_Loss',
'First Breach Type_Other', 'First Breach Type_Theft',
'First Breach Type_Unauthorized Access/Disclosure',
'First Breach Type_Unknown'],
dtype='object')
We can now generate a correlation heatmap to view potential relationships among relevant numerical types.
select_cols = ['Individuals Affected', 'Business Associate Present','Average Cost Per Record Per Breach Year', 'Estimated Total Breach Cost',
'Entity Type_Health Plan', 'Entity Type_Healthcare Clearing House', 'Entity Type_Healthcare Provider',
'First Breach Type_Hacking/IT Incident', 'First Breach Type_Improper Disposal', 'First Breach Type_Loss',
'First Breach Type_Other', 'First Breach Type_Theft', 'First Breach Type_Unauthorized Access/Disclosure',
'First Breach Type_Unknown']
df_selected = df[select_cols].copy()
corr_matrix = df_selected.corr(method="pearson")
fig = ff.create_annotated_heatmap(
z=corr_matrix.values,
x=list(corr_matrix.columns),
y=list(corr_matrix.index),
annotation_text=corr_matrix.round(2).values,
showscale=True,
colorscale='Viridis',
)
fig.update_layout(
title='Correlation Heatmap',
xaxis=dict(title='Columns', side='bottom', tickvals=[], ticktext=[]),
yaxis=dict(title='Columns', tickvals=[], ticktext=[]),
autosize=True
)
From the correlation heatmap, it was determined that there was very little meaningful correlation between variables that would indicate strong linear predictive ability in the dataset. Estimated Total Breach Cost was generated using individuals affected, so naturally the correlation (0.98) would be high. Although Estimated Total Cost would be very enlightening to predict, it was calculated in a very biased way, and will probably not be utilized for predictive tasks. Hacking and the average cost per record being decently correlated (0.51) is more of a consequence of their similar positive relationships with time. There is correlation between some of the categorical labels that have been encoded, but they were encoded from the same categorical, so this is also not too helpful.
If multi-variable regression were to be utilized, a non-linear regressor would probably be best in hopes that there are some non-linear relationships that could be captured. For time series forecasting, it is possible that univariate methods might perform more accurately. The forecasting target variable(s) are essentially the numbers of different kinds of breaches. Essentially, time series datasets need partitioned from the existing set for each kind.
For time series forecasting, we need to partition and create multiple dataframes according to time frame and frequency. We will initially try monthly and cut out May 2023 for being incomplete.
df_time_indexed = df[df['Breach Submission Date'] <= '2023-04-30'].copy().set_index('Breach Submission Date')
df_time_indexed_cyber = df[(df['First Breach Type'] == "Hacking/IT Incident") & (df['Breach Submission Date'] <= '2023-04-30')].copy().set_index('Breach Submission Date')
df_time_indexed_network = df[(df['First Location of Breach'] == "Network Server") & (df['Breach Submission Date'] <= '2023-04-30')].copy().set_index('Breach Submission Date')
df_time_indexed_email = df[(df['First Location of Breach'] == "Email") & (df['Breach Submission Date'] <= '2023-04-30')].copy().set_index('Breach Submission Date')
monthly_breaches = df_time_indexed.resample('M').size()
monthly_breaches_cyber = df_time_indexed_cyber.resample('M').size()
monthly_breaches_network = df_time_indexed_network.resample('M').size()
monthly_breaches_email = df_time_indexed_email.resample('M').size()
Will want the dataframes eventually for easier splitting.
monthly_breaches_df = monthly_breaches.to_frame(name='Number of Breaches Per Month')
monthly_breaches_cyber_df = monthly_breaches_cyber.to_frame(name='Number of Cyber Breaches Per Month')
monthly_breaches_network_df = monthly_breaches_network.to_frame(name='Number of Network Breaches Per Month')
monthly_breaches_email_df = monthly_breaches_email.to_frame(name='Number of Email Breaches Per Month')
monthly_breaches_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
monthly_breaches_cyber_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
monthly_breaches_network_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
monthly_breaches_email_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
After partitioning the dataset into different series of interest, the series can be examined (seasonality, trends, and autocorrelation) in order make sure they are stationary before modeling. Augmented Dickey–Fuller tests will be performed to verify
decomposition = seasonal_decompose(monthly_breaches, model='additive')
decomposition_cyber = seasonal_decompose(monthly_breaches_cyber, model='additive')
decomposition_network = seasonal_decompose(monthly_breaches_network, model='additive')
decomposition_email = seasonal_decompose(monthly_breaches_email, model='additive')
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches.index,
y=decomposition.seasonal,
mode='lines',
name='Seasonal Component'))
fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Breaches',
xaxis_title='Month',
yaxis_title='Seasonal Component')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_cyber.index,
y=decomposition_cyber.seasonal,
mode='lines',
name='Seasonal Component'))
fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Cyber Breaches',
xaxis_title='Month',
yaxis_title='Seasonal Component')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_network.index,
y=decomposition_network.seasonal,
mode='lines',
name='Seasonal Component'))
fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Network Breaches',
xaxis_title='Month',
yaxis_title='Seasonal Component')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_email.index,
y=decomposition_email.seasonal,
mode='lines',
name='Seasonal Component'))
fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Email Breaches',
xaxis_title='Month',
yaxis_title='Seasonal Component')
fig.show()
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches.rolling(window=12).mean()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches.index, y=monthly_breaches, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))
fig.update_layout(title='Trend Analysis All Breaches', xaxis_title='Time', yaxis_title='Number of Breaches')
fig.show()
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches_cyber.rolling(window=12).mean()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_cyber.index, y=monthly_breaches_cyber, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))
fig.update_layout(title='Trend Analysis Cyber', xaxis_title='Time', yaxis_title='Number of Cyber Breaches')
fig.show()
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches_network.rolling(window=12).mean()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_network.index, y=monthly_breaches_network, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))
fig.update_layout(title='Trend Analysis Network', xaxis_title='Time', yaxis_title='Number of Network Breaches')
fig.show()
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches_email.rolling(window=12).mean()
fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_email.index, y=monthly_breaches_email, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))
fig.update_layout(title='Trend Analysis Email', xaxis_title='Time', yaxis_title='Number of Email Breaches')
fig.show()
lags = range(1, 25)
autocorrelation = [monthly_breaches_df['Number of Breaches Per Month'].autocorr(lag=lag) for lag in lags]
autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})
fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))
fig.update_layout(title='Autocorrelation Analysis For All Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
lags = range(1, 25)
autocorrelation = [monthly_breaches_cyber_df['Number of Cyber Breaches Per Month'].autocorr(lag=lag) for lag in lags]
autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})
fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))
fig.update_layout(title='Autocorrelation Analysis For Cyber Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
lags = range(1, 25)
autocorrelation = [monthly_breaches_network_df['Number of Network Breaches Per Month'].autocorr(lag=lag) for lag in lags]
autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})
fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))
fig.update_layout(title='Autocorrelation Analysis For Network Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
lags = range(1, 25)
autocorrelation = [monthly_breaches_email_df['Number of Email Breaches Per Month'].autocorr(lag=lag) for lag in lags]
autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})
fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))
fig.update_layout(title='Autocorrelation Analysis For Email Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
There is a linear trend that can be observed from the trend graphs, which is not unexpected from the simple exploratory analysis. First order differencing may be all that is required to make the series stationary.
def adf_test(series):
result = adfuller(series)
print('ADF Statistic: {}'.format(result[0]))
print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
print('\t{}: {}'.format(key, value))
if result[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
adf_test(monthly_breaches)
ADF Statistic: -1.4103206432385045 p-value: 0.5773183932548773 Critical Values: 1%: -3.472703119504854 5%: -2.880131672353732 10%: -2.5766826861130268 The time series is non-stationary.
Because the time series is non stationary, we should apply seasonal differencing to transform it, but we should also split the data.
For training, we'll use all data up until and through 2020 and then 2021, 2022, & 2023 for validation.
train_total = monthly_breaches[monthly_breaches.index.year <= 2020]
val_total = monthly_breaches[monthly_breaches.index.year >= 2021]
train_cyber = monthly_breaches_cyber[monthly_breaches_cyber.index.year <= 2020]
val_cyber = monthly_breaches_cyber[monthly_breaches_cyber.index.year >= 2021]
train_network = monthly_breaches_network[monthly_breaches_network.index.year <= 2020]
val_network = monthly_breaches_network[monthly_breaches_network.index.year >= 2021]
train_email = monthly_breaches_email[monthly_breaches_email.index.year <= 2020]
val_email = monthly_breaches_email[monthly_breaches_email.index.year >= 2021]
total_differenced_series_train = train_total - train_total.shift(1)
total_differenced_series_train.dropna(inplace=True)
adf_test(total_differenced_series_train)
ADF Statistic: -4.405254874250095 p-value: 0.0002905404844863888 Critical Values: 1%: -3.4851223522012855 5%: -2.88553750045158 10%: -2.5795685622144586 The time series is stationary.
total_differenced_series_val = val_total - val_total.shift(1)
total_differenced_series_val.dropna(inplace=True)
adf_test(total_differenced_series_val)
ADF Statistic: -5.801423150268276 p-value: 4.621409050534281e-07 Critical Values: 1%: -3.7238633119999998 5%: -2.98648896 10%: -2.6328004 The time series is stationary.
def difference_series(series, lag=1):
differenced_series = series - series.shift(lag)
differenced_series.dropna(inplace=True)
return differenced_series
cyber_differenced_series_train = difference_series(train_cyber)
cyber_differenced_series_val = difference_series(val_cyber)
network_differenced_series_train = difference_series(train_network)
network_differenced_series_val = difference_series(val_network)
email_differenced_series_train = difference_series(train_email)
email_differenced_series_val = difference_series(val_email)
adf_test(cyber_differenced_series_train)
adf_test(network_differenced_series_train)
adf_test(email_differenced_series_train)
adf_test(cyber_differenced_series_val)
adf_test(network_differenced_series_val)
adf_test(email_differenced_series_val)
ADF Statistic: -2.908816604131042 p-value: 0.044332310506121776 Critical Values: 1%: -3.4885349695076844 5%: -2.887019521656941 10%: -2.5803597920604915 The time series is stationary. ADF Statistic: -3.3429397262549263 p-value: 0.013064831155840933 Critical Values: 1%: -3.485585145896754 5%: -2.885738566292665 10%: -2.5796759080663887 The time series is stationary. ADF Statistic: -3.3384710601041245 p-value: 0.013243988196147794 Critical Values: 1%: -3.4865346059036564 5%: -2.8861509858476264 10%: -2.579896092790057 The time series is stationary. ADF Statistic: -5.457767771580996 p-value: 2.558808583928109e-06 Critical Values: 1%: -3.7238633119999998 5%: -2.98648896 10%: -2.6328004 The time series is stationary. ADF Statistic: -5.04763693583765 p-value: 1.784751070709593e-05 Critical Values: 1%: -3.7238633119999998 5%: -2.98648896 10%: -2.6328004 The time series is stationary. ADF Statistic: -2.3847732934582515 p-value: 0.1460802333957621 Critical Values: 1%: -3.859073285322359 5%: -3.0420456927297668 10%: -2.6609064197530863 The time series is non-stationary.
Now that the different relevant time series is fully preprocessed, we can move onto model selection and validation.
Due to each time series only have about 160 monthly data points. It makes more sense to use more traditional time series forecasting models instead of a deep learning model, which would be better suited for much larger dataset and finer grained time series data such as SARIMA or Exponential Smoothing.
def fit_arima_and_forecast_validate(train, val):
model = pm.auto_arima(train)
pred_train = model.predict_in_sample()
pred_val = model.predict(n_periods=len(val))
return model, pred_train, pred_val
def fit_exponential_smoothing_and_forecast_validate(train, val, seasonal_periods=12):
model = ExponentialSmoothing(train, seasonal_periods=seasonal_periods, seasonal='add', trend='add').fit()
pred_train = model.fittedvalues
pred_val = model.forecast(steps=len(val))
return model, pred_train, pred_val
Do training and validation operations
model_total, pred_total_train, pred_total_val = fit_arima_and_forecast_validate(train_total, val_total)
model_cyber, pred_cyber_train, pred_cyber_val = fit_arima_and_forecast_validate(train_cyber, val_cyber)
model_network, pred_network_train, pred_network_val = fit_arima_and_forecast_validate(train_network, val_network)
model_email, pred_email_train, pred_email_val = fit_arima_and_forecast_validate(train_email, val_email)
model_total_es, pred_total_es_train, pred_total_es_val = fit_exponential_smoothing_and_forecast_validate(train_total, val_total)
model_cyber_es, pred_cyber_es_train, pred_cyber_es_val = fit_exponential_smoothing_and_forecast_validate(train_cyber, val_cyber)
model_network_es, pred_network_es_train, pred_network_es_val = fit_exponential_smoothing_and_forecast_validate(train_network, val_network)
model_email_es, pred_email_es_train, pred_email_es_val = fit_exponential_smoothing_and_forecast_validate(train_email, val_email)
def plot_actual_vs_predicted(train, val, pred_train, pred_val, title):
# Reverse the differencing transformation
train_reversed = train + train.iloc[0]
val_reversed = val + val.iloc[0]
pred_train_reversed = pred_train + train.iloc[0]
pred_val_reversed = pred_val + val.iloc[0]
fig = go.Figure()
# Add training data (actual and predicted)
fig.add_trace(go.Scatter(x=train_reversed.index, y=train_reversed, mode='markers', name='Actual Train'))
fig.add_trace(go.Scatter(x=train_reversed.index, y=pred_train_reversed, mode='lines', name='Predicted Train'))
# Add validation data (actual and predicted)
fig.add_trace(go.Scatter(x=val_reversed.index, y=val_reversed, mode='markers', name='Actual Validation'))
fig.add_trace(go.Scatter(x=val_reversed.index, y=pred_val_reversed, mode='lines', name='Predicted Validation'))
fig.update_layout(title=title, xaxis_title='Date', yaxis_title='Value', autosize=True)
fig.show()
# Plot the actual vs predicted values for SARIMA models
plot_actual_vs_predicted(train_total, val_total, pred_total_train, pred_total_val, title='ARIMA - Total')
plot_actual_vs_predicted(train_cyber, val_cyber, pred_cyber_train, pred_cyber_val, title='ARIMA - Cyber')
plot_actual_vs_predicted(train_network, val_network, pred_network_train, pred_network_val, title='ARIMA - Network')
plot_actual_vs_predicted(train_email, val_email, pred_email_train, pred_email_val, title='ARIMA - Email')
# Plot the actual vs predicted values for Exponential Smoothing models
plot_actual_vs_predicted(train_total, val_total, pred_total_es_train, pred_total_es_val, title='Exponential Smoothing - Total')
plot_actual_vs_predicted(train_cyber, val_cyber, pred_cyber_es_train, pred_cyber_es_val, title='Exponential Smoothing - Cyber')
plot_actual_vs_predicted(train_network, val_network, pred_network_es_train, pred_network_es_val, title='Exponential Smoothing - Network')
plot_actual_vs_predicted(train_email, val_email, pred_email_es_train, pred_email_es_val, title='Exponential Smoothing - Email')
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def calculate_accuracy_metrics(train, val, pred_train, pred_val, title):
# Reverse the differencing transformation
train_reversed = train + train.iloc[0]
val_reversed = val + val.iloc[0]
pred_train_reversed = pred_train + train.iloc[0]
pred_val_reversed = pred_val + val.iloc[0]
mae_train = mean_absolute_error(train_reversed, pred_train_reversed)
mse_train = mean_squared_error(train_reversed, pred_train_reversed)
rmse_train = np.sqrt(mse_train)
mape_train = mean_absolute_percentage_error(train_reversed, pred_train_reversed)
mae_val = mean_absolute_error(val_reversed, pred_val_reversed)
mse_val = mean_squared_error(val_reversed, pred_val_reversed)
rmse_val = np.sqrt(mse_val)
mape_val = mean_absolute_percentage_error(val_reversed, pred_val_reversed)
print(f'{title} Training\nMAE: {mae_train:.2f}, MSE: {mse_train:.2f}, RMSE: {rmse_train:.2f}, MAPE: {mape_train:.2f}')
print(f'{title} Validation\nMAE: {mae_val:.2f}, MSE: {mse_val:.2f}, RMSE: {rmse_val:.2f}, MAPE: {mape_val:.2f}')
# print metrics for SARIMA models
calculate_accuracy_metrics(train_total, val_total, pred_total_train, pred_total_val, title='ARIMA - Total')
calculate_accuracy_metrics(train_cyber, val_cyber, pred_cyber_train, pred_cyber_val, title='ARIMA - Cyber')
calculate_accuracy_metrics(train_network, val_network, pred_network_train, pred_network_val, title='ARIMA - Network')
calculate_accuracy_metrics(train_email, val_email, pred_email_train, pred_email_val, title='ARIMA - Email')
ARIMA - Total Training MAE: 5.87, MSE: 66.53, RMSE: 8.16, MAPE: 21.28 ARIMA - Total Validation MAE: 14.32, MSE: 325.55, RMSE: 18.04, MAPE: 16.84 ARIMA - Cyber Training MAE: 4.07, MSE: 57.16, RMSE: 7.56, MAPE: 76.26 ARIMA - Cyber Validation MAE: 11.14, MSE: 191.71, RMSE: 13.85, MAPE: 18.92 ARIMA - Network Training MAE: 2.98, MSE: 37.53, RMSE: 6.13, MAPE: 66.61 ARIMA - Network Validation MAE: 9.46, MSE: 141.42, RMSE: 11.89, MAPE: 20.64 ARIMA - Email Training MAE: 2.11, MSE: 10.46, RMSE: 3.23, MAPE: 49.58 ARIMA - Email Validation MAE: 8.54, MSE: 95.53, RMSE: 9.77, MAPE: 31.75
# print metrics for Exponential Smoothing Models
calculate_accuracy_metrics(train_total, val_total, pred_total_es_train, pred_total_es_val, title='Exponential Smoothing - Total')
calculate_accuracy_metrics(train_cyber, val_cyber, pred_cyber_es_train, pred_cyber_es_val, title='Exponential Smoothing - Cyber')
calculate_accuracy_metrics(train_network, val_network, pred_network_es_train, pred_network_es_val, title='Exponential Smoothing - Network')
calculate_accuracy_metrics(train_email, val_email, pred_email_es_train, pred_email_es_val, title='Exponential Smoothing - Email')
Exponential Smoothing - Total Training MAE: 5.81, MSE: 69.81, RMSE: 8.36, MAPE: 20.67 Exponential Smoothing - Total Validation MAE: 14.12, MSE: 301.53, RMSE: 17.36, MAPE: 16.60 Exponential Smoothing - Cyber Training MAE: 4.11, MSE: 57.29, RMSE: 7.57, MAPE: 59.60 Exponential Smoothing - Cyber Validation MAE: 18.68, MSE: 530.12, RMSE: 23.02, MAPE: 31.01 Exponential Smoothing - Network Training MAE: 3.20, MSE: 38.07, RMSE: 6.17, MAPE: 66.11 Exponential Smoothing - Network Validation MAE: 8.63, MSE: 100.25, RMSE: 10.01, MAPE: 20.93 Exponential Smoothing - Email Training MAE: 2.11, MSE: 10.34, RMSE: 3.22, MAPE: 41.64 Exponential Smoothing - Email Validation MAE: 12.69, MSE: 196.98, RMSE: 14.03, MAPE: 46.51
models = ['ARIMA Training', 'Exponential Smoothing Training']
mae = [5.87, 5.81]
mse = [66.53, 69.81]
rmse = [8.16, 8.36]
mape = [21.28, 20.67]
trace1 = go.Bar(x=models, y=mae, name='MAE', text=mae, textposition='auto')
trace2 = go.Bar(x=models, y=mse, name='MSE', text=mse, textposition='auto')
trace3 = go.Bar(x=models, y=rmse, name='RMSE', text=rmse, textposition='auto')
trace4 = go.Bar(x=models, y=mape, name='MAPE', text=mape, textposition='auto')
layout = go.Layout(
title='Training Total Breaches Model Comparison',
barmode='group'
)
fig = go.Figure(data=[trace1, trace2, trace3, trace4], layout=layout)
fig.show()
models = ['ARIMA Validation', 'Exponential Smoothing Validation']
mae_val = [16.25, 16.11]
mse_val = [485.49, 469.57]
rmse_val = [22.03, 21.67]
mape_val = [22.65, 22.55]
trace1 = go.Bar(x=models, y=mae_val, name='MAE', text=mae_val, textposition='auto')
trace2 = go.Bar(x=models, y=mse_val, name='MSE', text=mse_val, textposition='auto')
trace3 = go.Bar(x=models, y=rmse_val, name='RMSE', text=rmse_val, textposition='auto')
trace4 = go.Bar(x=models, y=mape_val, name='MAPE', text=mape_val, textposition='auto')
layout_val = go.Layout(
title='Model Comparison for Validation',
barmode='group'
)
fig_val = go.Figure(data=[trace1, trace2, trace3, trace4], layout=layout_val)
fig_val.show()
Ultimately, I am selecting Exponential Smoothing having slightly better validation RMSE scores. I am not concerned with precise accuracy but generally close counts, but these models behaved so closely that their performance differences might be negligible.
Define pipeline functions
def exponential_smoothing_forecast_pipeline(entire_time_series, desired_periods = 20):
differenced_series = difference_series(entire_time_series) # Apply similar transformation developed in test
model = ExponentialSmoothing(differenced_series, seasonal_periods=12, seasonal='add', trend='add').fit()
predictions = model.forecast(steps=desired_periods)
# Reversing the differencing but this time according to the number of forecast periods
for i in range(desired_periods):
if i < len(entire_time_series):
predictions.iloc[i] += entire_time_series.iloc[-(i+1)]
else:
predictions.iloc[i] += predictions.iloc[i - len(entire_time_series)]
# Rounding down to the nearest integer to avoid counting partial breaches
predictions = np.floor(predictions)
return predictions
total_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches)
cyber_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches_cyber)
network_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches_network)
email_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches_email)
total_breaches_forecasted_through_to_2025
2023-05-31 41.0 2023-06-30 63.0 2023-07-31 48.0 2023-08-31 34.0 2023-09-30 50.0 2023-10-31 53.0 2023-11-30 72.0 2023-12-31 67.0 2024-01-31 50.0 2024-02-29 68.0 2024-03-31 79.0 2024-04-30 74.0 2024-05-31 54.0 2024-06-30 46.0 2024-07-31 50.0 2024-08-31 46.0 2024-09-30 64.0 2024-10-31 69.0 2024-11-30 57.0 2024-12-31 46.0 Freq: M, dtype: float64
cyber_breaches_forecasted_through_to_2025
2023-05-31 30.0 2023-06-30 48.0 2023-07-31 36.0 2023-08-31 19.0 2023-09-30 37.0 2023-10-31 40.0 2023-11-30 48.0 2023-12-31 58.0 2024-01-31 36.0 2024-02-29 59.0 2024-03-31 65.0 2024-04-30 54.0 2024-05-31 41.0 2024-06-30 43.0 2024-07-31 43.0 2024-08-31 37.0 2024-09-30 54.0 2024-10-31 52.0 2024-11-30 44.0 2024-12-31 35.0 Freq: M, dtype: float64
network_breaches_forecasted_through_to_2025
2023-05-31 27.0 2023-06-30 42.0 2023-07-31 28.0 2023-08-31 19.0 2023-09-30 28.0 2023-10-31 35.0 2023-11-30 46.0 2023-12-31 46.0 2024-01-31 28.0 2024-02-29 38.0 2024-03-31 42.0 2024-04-30 25.0 2024-05-31 25.0 2024-06-30 33.0 2024-07-31 26.0 2024-08-31 26.0 2024-09-30 39.0 2024-10-31 40.0 2024-11-30 27.0 2024-12-31 22.0 Freq: M, dtype: float64
email_breaches_forecasted_through_to_2025
2023-05-31 6.0 2023-06-30 13.0 2023-07-31 10.0 2023-08-31 6.0 2023-09-30 12.0 2023-10-31 11.0 2023-11-30 14.0 2023-12-31 8.0 2024-01-31 12.0 2024-02-29 15.0 2024-03-31 11.0 2024-04-30 24.0 2024-05-31 12.0 2024-06-30 10.0 2024-07-31 18.0 2024-08-31 12.0 2024-09-30 16.0 2024-10-31 16.0 2024-11-30 18.0 2024-12-31 13.0 Freq: M, dtype: float64
# Create traces
trace_total = go.Scatter(
x = total_breaches_forecasted_through_to_2025.index,
y = total_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Total Breaches'
)
trace_cyber = go.Scatter(
x = cyber_breaches_forecasted_through_to_2025.index,
y = cyber_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Cyber Breaches'
)
trace_network = go.Scatter(
x = network_breaches_forecasted_through_to_2025.index,
y = network_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Network Breaches'
)
trace_email = go.Scatter(
x = email_breaches_forecasted_through_to_2025.index,
y = email_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Email Breaches'
)
data = [trace_total, trace_cyber, trace_network, trace_email]
# Define layout
layout = go.Layout(
title = 'Breach Type Forecast Comparison (2023-2025)',
xaxis = dict(title = 'Date'),
yaxis = dict(title = 'Number of Breaches')
)
fig = go.Figure(data=data, layout=layout)
fig.show()
The models predict that breaches will continue to rise, although not dramatically so, although cyber breaches seem to be overestimated.
When applying the estimated total breach cost of the historical dataset to the forecasted counts, one that might be conservative, breaches may cost the healthcare industry 9 billion dollars in the next year and a half.
forecast_total_breach_cost_estimation = total_breaches_forecasted_through_to_2025.sum() * df['Estimated Total Breach Cost'].mean()
print(f'${forecast_total_breach_cost_estimation:,.2f}')
$9,030,055,980.31
To examine the whole story, the historical data and forecasted data was plotted together.
trace_total_hist = go.Scatter(
x = monthly_breaches.index,
y = monthly_breaches,
mode = 'lines',
name = 'Total Breaches - Historical',
line = dict(color='darkblue')
)
trace_cyber_hist = go.Scatter(
x = monthly_breaches_cyber.index,
y = monthly_breaches_cyber,
mode = 'lines',
name = 'Cyber Breaches - Historical',
line = dict(color='darkred')
)
trace_network_hist = go.Scatter(
x = monthly_breaches_network.index,
y = monthly_breaches_network,
mode = 'lines',
name = 'Network Breaches - Historical',
line = dict(color='darkgreen')
)
trace_email_hist = go.Scatter(
x = monthly_breaches_email.index,
y = monthly_breaches_email,
mode = 'lines',
name = 'Email Breaches - Historical',
line = dict(color='darkorange')
)
# Create traces for forecasted data
trace_total_forecast = go.Scatter(
x = total_breaches_forecasted_through_to_2025.index,
y = total_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Total Breaches - Forecasted',
line = dict(color='lightblue')
)
trace_cyber_forecast = go.Scatter(
x = cyber_breaches_forecasted_through_to_2025.index,
y = cyber_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Cyber Breaches - Forecasted',
line = dict(color='salmon')
)
trace_network_forecast = go.Scatter(
x = network_breaches_forecasted_through_to_2025.index,
y = network_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Network Breaches - Forecasted',
line = dict(color='lightgreen')
)
trace_email_forecast = go.Scatter(
x = email_breaches_forecasted_through_to_2025.index,
y = email_breaches_forecasted_through_to_2025,
mode = 'lines',
name = 'Email Breaches - Forecasted',
line = dict(color='orange')
)
data = [trace_total_hist, trace_cyber_hist, trace_network_hist, trace_email_hist,
trace_total_forecast, trace_cyber_forecast, trace_network_forecast, trace_email_forecast]
layout = go.Layout(
title = 'Breach Type Comparison (Historical(dark) and Forecasted(light))',
xaxis = dict(title = 'Date'),
yaxis = dict(title = 'Number of Breaches')
)
fig = go.Figure(data=data, layout=layout)
fig.show()
When looking at 2018-2025, we can see the historical cutoff just before the middle of 2023 where we'd expect it today. The lighter colors represent forecasted values relative to their darker historical counterparts. Rewinding to the last quarter of 2020, we can see the breaches carried out while healthcare was dealing with the shock of Covid. It was the perfect storm for healthcare providers, in that remote work became necessary very quickly for as much staff as possible, creating holes in their cybersecurity, as well as the overall influx of work and patients due to Covid. This is pertinent to the modeling in that this noise could have heavily spiked future predictions, but the models did not seem to be dramatically affected by this. Unless something as impactful as Covid occurs, another spike should not occur. Yet, breaches should still be expected to become more frequent with cyber breaches accounting for a majority of them.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349636/ - Previous Forecasting Attempts https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf - Breach Dataset https://www.hhs.gov/hipaa/for-professionals/privacy/guidance/business-associates/index.html - Covered Entity Types Table https://www.ibm.com/downloads/cas/3R8N1DZJ - Utilized to assist with imputing average cost per record https://search.r-project.org/CRAN/refmans/Ecdat/html/HHSCyberSecurityBreaches.html - used to fill in missing missing breach type for one record https://www.healthdatamanagement.com/articles/tax-fraud-fueled-breach-hits-36th-fire-department - used to fill in missing Individuals Affected for one record. https://www.ibm.com/downloads/cas/OJDVQGRY - Source for US dollar influence on record cost https://www.experian.com/blogs/ask-experian/heres-how-much-your-personal-information-is-selling-for-on-the-dark-web/ - Used to reinforce significant value of healthcare records on the dark web